Eecient Software Synchronization on Large Cache Coherent Multiprocessors
نویسندگان
چکیده
Large-scale shared-memory multiprocessors typically have long latencies for remote data accesses. A key issue for execution performance of many common applications is the synchronization cost. The communication scalability of synchronization has been improved by the introduction of queue-based spin-locks instead of Test&(Test&Set). For architectures with long access latencies for global data, attention should also be paid to the number of global accesses that are involved in synchronization. We present a method to characterize the performance of proposed queue lock algorithms, and apply it to previously published algorithms. We also present two new queue locks, the LH lock and the M lock. We compare the locks in terms of performance, memory requirements, code size, and required hardware support. The LH lock is the simplest of all the locks, yet requires only an atomic swap operation. The M lock is superior in terms of global accesses needed to perform synchronization and still competitive in all other criteria. We conclude that the M lock is the best overall queue lock for the class of architectures studied. Currently with Sun Microsystems, work done while at SICS.
منابع مشابه
Software Caching on Cache-Coherent Multiprocessors
Programmers have always been concerned with data distribution and remote memory access costs on shared-memory multiprocessors that lack coherent caches, like the BBN Butterry. Recently memory latency has become an important issue on cache-coherent multiprocessors, where dramatic improvements in microprocessor performance have increased the relative cost of cache misses and coherency transaction...
متن کاملEXECUTING NESTED PARALLEL LOOPS ON SHARED - MEMORYMULTIPROCESSORSSadun
Cache-coherent, bus-based shared-memory multiprocessors are a cost-eeective platform for parallel processing. In scientiic parallel applications, most of the computation involves processing of large multidimensional data structures which results in a high degree of data parallelism. This parallelism can be exploited in the form of nested parallel loops. Most existing shared memory multiprocesso...
متن کاملFast Synchronization on Scalable Cache-Coherent Multiprocessors using Hybrid Primitives
This paper presents a new methodology for implementing fast synchronization on scalable cache-coherent multiprocessors, through the use of hybrid primitives. Hybrid primitives leverage commodity hardware to speed-up the execution of the atomic remote Read-Modify-Write (RMW) instructions employed in synchronization algorithms to resolve contending processors, while exploiting the caches to reduc...
متن کاملA Preliminary Evaluation of Cache-miss-initiated Prefetching Techniques in Scalable Multiprocessors
Prefetching is an important technique for reducing the average latency of memory accesses in scalable cache-coherent multiprocessors. Aggressive prefetching can signiicantly reduce the number of cache misses, but may introduce bursty network and memory traac, and increase data sharing and cache pollution. Given that we anticipate enormous increases in both network bandwidth and latency, we exam...
متن کاملShared Virtual Memory Clusters with Next-Generation Interconnection Networks and Wide Compute Nodes
Recently much effort has been spent on providing a shared address space abstraction on clusters of small–scale symmetric multiprocessors. However, advances in technology will soon make it possible to construct these clusters with larger–scale cc-NUMA nodes, connected with non-coherent networks that offer latencies and bandwidth comparable to interconnection networks used in hardware cache–coher...
متن کامل